Explore and summarize data

Prosper Marketplace is America’s first peer-to-peer lending marketplace, with over $7 billion in funded loans. Here, borrowers can request personal loans and investors can fund anywhere from $2,000 to $35,000 per loan request. I am interested in the borrowers, lenders and Prosper’s service and fee from them. Therefore, this study will be the EDA about data set prosperLoanData.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(forcats)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(ggthemes)
library(scales)
prosper_raw = read.csv('prosperLoanData.csv')
dim(prosper_raw)
## [1] 113937     81
names(prosper_raw)
##  [1] "ListingKey"                         
##  [2] "ListingNumber"                      
##  [3] "ListingCreationDate"                
##  [4] "CreditGrade"                        
##  [5] "Term"                               
##  [6] "LoanStatus"                         
##  [7] "ClosedDate"                         
##  [8] "BorrowerAPR"                        
##  [9] "BorrowerRate"                       
## [10] "LenderYield"                        
## [11] "EstimatedEffectiveYield"            
## [12] "EstimatedLoss"                      
## [13] "EstimatedReturn"                    
## [14] "ProsperRating..numeric."            
## [15] "ProsperRating..Alpha."              
## [16] "ProsperScore"                       
## [17] "ListingCategory..numeric."          
## [18] "BorrowerState"                      
## [19] "Occupation"                         
## [20] "EmploymentStatus"                   
## [21] "EmploymentStatusDuration"           
## [22] "IsBorrowerHomeowner"                
## [23] "CurrentlyInGroup"                   
## [24] "GroupKey"                           
## [25] "DateCreditPulled"                   
## [26] "CreditScoreRangeLower"              
## [27] "CreditScoreRangeUpper"              
## [28] "FirstRecordedCreditLine"            
## [29] "CurrentCreditLines"                 
## [30] "OpenCreditLines"                    
## [31] "TotalCreditLinespast7years"         
## [32] "OpenRevolvingAccounts"              
## [33] "OpenRevolvingMonthlyPayment"        
## [34] "InquiriesLast6Months"               
## [35] "TotalInquiries"                     
## [36] "CurrentDelinquencies"               
## [37] "AmountDelinquent"                   
## [38] "DelinquenciesLast7Years"            
## [39] "PublicRecordsLast10Years"           
## [40] "PublicRecordsLast12Months"          
## [41] "RevolvingCreditBalance"             
## [42] "BankcardUtilization"                
## [43] "AvailableBankcardCredit"            
## [44] "TotalTrades"                        
## [45] "TradesNeverDelinquent..percentage." 
## [46] "TradesOpenedLast6Months"            
## [47] "DebtToIncomeRatio"                  
## [48] "IncomeRange"                        
## [49] "IncomeVerifiable"                   
## [50] "StatedMonthlyIncome"                
## [51] "LoanKey"                            
## [52] "TotalProsperLoans"                  
## [53] "TotalProsperPaymentsBilled"         
## [54] "OnTimeProsperPayments"              
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"    
## [57] "ProsperPrincipalBorrowed"           
## [58] "ProsperPrincipalOutstanding"        
## [59] "ScorexChangeAtTimeOfListing"        
## [60] "LoanCurrentDaysDelinquent"          
## [61] "LoanFirstDefaultedCycleNumber"      
## [62] "LoanMonthsSinceOrigination"         
## [63] "LoanNumber"                         
## [64] "LoanOriginalAmount"                 
## [65] "LoanOriginationDate"                
## [66] "LoanOriginationQuarter"             
## [67] "MemberKey"                          
## [68] "MonthlyLoanPayment"                 
## [69] "LP_CustomerPayments"                
## [70] "LP_CustomerPrincipalPayments"       
## [71] "LP_InterestandFees"                 
## [72] "LP_ServiceFees"                     
## [73] "LP_CollectionFees"                  
## [74] "LP_GrossPrincipalLoss"              
## [75] "LP_NetPrincipalLoss"                
## [76] "LP_NonPrincipalRecoverypayments"    
## [77] "PercentFunded"                      
## [78] "Recommendations"                    
## [79] "InvestmentFromFriendsCount"         
## [80] "InvestmentFromFriendsAmount"        
## [81] "Investors"

There are 113937 observations and 81 variables in the data set.

Due to many variables in this dataset, I only use 15 variables. Those are:

# Create new dataframe for the 15 variables.
prosper <- select(prosper_raw,'Term','BorrowerAPR','BorrowerRate','LenderYield',
                  'ProsperRating..numeric.', 'ProsperRating..Alpha.',
                  'ListingCategory..numeric.',
                  'Occupation','EmploymentStatus','LoanOriginalAmount',
                  'IncomeRange','CurrentDelinquencies',
                  'LoanOriginationDate','LoanOriginationQuarter','Investors')

# This function will reformat LoanOriginationDate for the date and 
# kepp part (YYYY-MM-DD) only

date_tran <- function(x) {
  return(as.Date(substr(x,1,10)))
  }
prosper$LoanOriginationDate <- date_tran(prosper$LoanOriginationDate)

dim(prosper)
## [1] 113937     15
str(prosper)
## 'data.frame':    113937 obs. of  15 variables:
##  $ Term                     : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ BorrowerAPR              : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate             : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield              : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ ProsperRating..numeric.  : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.    : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ListingCategory..numeric.: int  0 2 0 16 2 1 1 2 7 7 ...
##  $ Occupation               : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus         : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ LoanOriginalAmount       : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ IncomeRange              : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ CurrentDelinquencies     : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ LoanOriginationDate      : Date, format: "2007-09-12" "2014-03-03" ...
##  $ LoanOriginationQuarter   : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ Investors                : int  258 1 41 158 20 1 1 1 1 1 ...

Univariate Plots Section

In this section, I want to explore the data set regarding to borrowers prosper, the Loan Term they prefer to select, the prosper rating labeled, the borrowers’ occupation and their annual income distribution.

# view which term borrowers prefer to create
ggplot(data = prosper,aes(x = Term)) +
  geom_bar(aes(y=(..count..)/sum(..count..))) +
  scale_y_continuous(labels=percent_format()) +
  xlab('Term (months)') +
  ylab('Percent') 

table(prosper$Term)
## 
##    12    36    60 
##  1614 87778 24545
range(substr(subset(prosper,prosper$Term == '12')$LoanOriginationQuarter,1,4))
## [1] "Q1 2" "Q4 2"

There are almost 80% of the borrowers more likely to choose 3-year fixed loan term while they creating the listing. There are 1614 listing with fixed Term 1 year, I am not sure if this part of data is correct or not? as from prosper.com, that only support Fixed term—3 or 5 years.

# View the occupation distributions of borrowers
prosper <- within(prosper, 
                   Occupation <- factor(Occupation, 
                                      levels=names(sort(table(Occupation), 
                                                        decreasing=TRUE))))
ggplot(data = prosper, aes(x = Occupation)) +
  geom_bar() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

If we don’t consider the “Other” case, Professional has the most top count and almost three times as many as the second. The count of borrowers on the top three is professional, Computer Programer, Executive, at the bottom of the list are the students, generally, the students don’t need to borrow money, or they don’t have strong credit rating to create loan request from prosper.

# View ths Prosper Rating distribution of borrowers
ggplot(data = prosper,aes(x = factor(ProsperRating..Alpha.,
                      levels = c("AA","A","B","C","D","E","HR")))) +
  geom_bar() +
  xlab('Prosper Rating')

table(prosper$ProsperRating..Alpha.)
## 
##           A    AA     B     C     D     E    HR 
## 29084 14551  5372 15581 18345 14274  9795  6935

Prosper Ratings, from lowest-risk to highest-risk, are labeled AA, A, B, C, D, E, and HR (“High Risk”), we can see 29084 prosper rating is labeled “NA”,this is understandable since the prosper rating is provided since 2009 after SEC registration, there is no rating labeled to the borrowers before 2009. If we not consider the “NA” case, the prosper rating appears Binomial distribution, 57% of borrowers are labeled middle risk B,C and D. Borrowers with rating C with 22% the highest frequency.

# View ths income range distribution of borrowers
ggplot(data = prosper, aes(x = IncomeRange)) + 
  geom_bar() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

table(prosper$IncomeRange)
## 
##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##            621           7274          17337          32192          31050 
## $75,000-99,999  Not displayed   Not employed 
##          16916           7741            806

The borrower’s income ranges appears Binomial distribution. most of borrowers’ annual income are fall into $25,000 - 75,000. and there are 621 borrowers with zero incomes they got loans, it’s default value if prosper not get the source of borrower’s incomes, or still the borrowers can get loans if they have good credit rating from other path.

# For those borrowers with zero incomes, 
# will study if the employment status while they creating loan request?
ggplot(data = subset(prosper,IncomeRange == '$0'),
       aes(x = EmploymentStatus)) +
  geom_bar()

dim(subset(prosper,IncomeRange == '$0'& EmploymentStatus == 'Not employed'))
## [1] 11 15

Further exploring the borrowers whose incomes are zero, find 11 borrowers are Not employed, they may not have incomes? all the loan requests were created on Q3 2007. I guess it should be data issue on the person with zero annual income.

prosper <- within(prosper, 
                   ListingCategory..numeric. <- 
                    factor(ListingCategory..numeric.,
                           levels=names(sort(table(ListingCategory..numeric.), 
                                             decreasing=TRUE))))

ggplot(data = prosper,aes(x = ListingCategory..numeric.)) +
  geom_bar() + 
  xlab('Ordinary list category')

table(prosper$ListingCategory..numeric.)
## 
##     1     0     7     2     3     6     4    13    15    18    14    20 
## 58308 16965 10494  7433  7189  2572  2395  1996  1522   885   876   771 
##    19     5    16    11     8    10     9    12    17 
##   768   756   304   217   199    91    85    59    52

Category: The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans.

Out of the listing category, three categories comes out as the highest past 10000 loans. There’s N/A and Other categories, so we can’t know for sure the specific category. But one comes out highest which is category 1 - Debt Consolidation, where’s one take out a loan to pay many others. This comes really high with 58308 loans,it could be that many Prosper visitors comes with already have loans, and want to search some loans to pay for it?

Univariate Analysis

There are 113937 observations and 81 variables in the raw data set. Due to many variables in this dataset, I only use 18 variables.

Which Term loan the borrowers prefer to select?

There are almost 80% of the borrowers prefer to select 3-year fixed loan while they creating the list. What’s the reason leading borrowers select 3-year,is it related to BorrowerRate? Is longer term caused higher BorrowerRate? I will explore in the following section later.

what people need money?

If we don’t consider the “Other” case, Professional has the most top count and almost three times as many as the second. The count of borrowers on the top three are Professinal, Computer Programer and Executive, at the bottom of the list are the students.

what prosper rating labeled to borrowers?

For the prosper rating labeled, the prosper rating appears Binomial distribution, 57% of borrowers are labeled middle risk B,C and D. Borrowers with rating C with 22% the highest frequency.

what’s the borrowers’ annual incomes?

The borrowers’ income ranges appears Binomial distribution. 78% of borrowers’ annual income are fall into $25,000 - 75,000. and there are 621 borrowers with zero incomes they got loans, further exploring the borrowers whose incomes are zero, only find 11 borrowers are Not employed while they creating list on prosper, I guess it should be data issue on the person with zero annual income.

Why people need money?

Out of the listing category, three categories comes out as the highest past 10000 loans. There’s N/A and Other categories, so we can’t know for sure the specific category. But one comes out highest which is category 1 - Debt Consolidation, where’s one take out a loan to pay many others. This comes really high with 58308 loans, it could be that many Prosper visitors comes with already have loans, and want to search some loans to pay for it?

Bivariate Plots Section

#Plot Prosper rating vs. BorrowerRate

prosper$ProsperRating..Alpha. <- 
  factor(prosper$ProsperRating..Alpha., 
         levels = c("AA","A","B","C","D","E","HR"))

ggplot(data = subset(prosper,ProsperRating..Alpha. != ''),
       aes(x = ProsperRating..Alpha.,y = BorrowerRate),group = 1) +
  geom_point(alpha = 0.1, position = position_jitter(h = 0,width = 0.5),color = I('#a0522d'))+
  ylim(0,quantile(prosper$BorrowerRate,0.95)) 
## Warning: Removed 3612 rows containing missing values (geom_point).

with(prosper,cor.test(as.numeric(ProsperRating..Alpha.),BorrowerRate))
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(ProsperRating..Alpha.) and BorrowerRate
## t = 917.37, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9524846 0.9537172
## sample estimates:
##       cor 
## 0.9531049

AA LOWEST risk-> HR HIGHEST risk

We see that the higher risk labeled to borrowers,the higher borrower rate they would cost. And we see that it has a strong positive linear relationship. This is can be explained as the interest rate is the one where the loaner has benefit from the money that he/she loans.

# plot boxplot for prosper rating and loanOriginalAmount

p1 = ggplot(data = prosper,aes(x = ProsperRating..Alpha., y = LoanOriginalAmount)) +
  geom_boxplot() +
  scale_y_continuous(breaks = seq(0,36000,5000))

p2 = ggplot(data = prosper,aes(x = ProsperRating..Alpha., y = LoanOriginalAmount)) +
  geom_boxplot() +
  scale_y_continuous(breaks = seq(0,36000,5000)) +
  coord_cartesian(ylim = c(3000,15500))

grid.arrange(p1,p2,ncol = 2)

with(prosper,cor.test(as.numeric(ProsperRating..Alpha.),LoanOriginalAmount))
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(ProsperRating..Alpha.) and LoanOriginalAmount
## t = -138.17, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4340341 -0.4230486
## sample estimates:
##        cor 
## -0.4285572

It shows rating E and HR has smaller IQR compared to the rest of the rating, rating B and C has no outliner, rating A,B has a smaller outliner, rating AA has two outline only. With a correlation of -0.43, it shows a negative moderately strong linear relationship, as the risk rating increase, the lower amount would be borrowed from prosper.

# plot investor vs. borrower's risk rating

ggplot(data = subset(prosper,strtrim(ProsperRating..Alpha.,
                       width = 1) != ""),
                     aes(x = ProsperRating..Alpha.,y = Investors)) +
  geom_bar(stat = 'identity') +
  xlab('Prosper Rating')

The distribution has a bell shape and is right-skewed, we can see that investors prefer to loan to those borrowers holding lower risk rating. This is understandable since loan to those person with higher risk may be higher loss the money, even though the lenders would have higher yield by lending to higher risk case (see the next plot).

# plot bar for ProsperRating vs lenderYield
ggplot(data = subset(prosper,strtrim(ProsperRating..Alpha.,width = 1) != ""), 
       aes(x = ProsperRating..Alpha.,
           y = LenderYield)) +
  #stat_summary(fun.y = 'mean',geom = 'bar') +
  geom_point(position = position_jitter(h = 0,width = 1)) +
  xlab('Prosper Rating') +
  ylim(0,quantile(prosper$LenderYield,0.95)) 
## Warning: Removed 3611 rows containing missing values (geom_point).

with(prosper,cor.test(as.numeric(ProsperRating..Alpha.),LenderYield))
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(ProsperRating..Alpha.) and LenderYield
## t = 917.52, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9524993 0.9537315
## sample estimates:
##       cor 
## 0.9531194

We see that it has a strong positive linear relationship between risk rating and lenderYield, that means that if the lenders want to have higher yield, they would take on more risk of lose rate.

# plot LenderYield vs. LoanOriginationDate

ggplot(data = prosper,aes(x = LoanOriginationDate,y = LenderYield)) +
  geom_line(stat = 'summary', fun.y = 'mean') +
  scale_x_date(date_breaks = '6 months') +
  theme(axis.text.x=element_text(angle=45,hjust=1,vjust=0.5)) +
  xlab('') + 
  ylab('Lender Yield') +
  scale_y_log10()
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 22 rows containing non-finite values (stat_summary).

The LeanderYield quickly increase in the initial 1-year, then in a stable line after 2006. But the thing that surprised me is there was no LenderYield from Q3 2008 to Q2 2009. Then I searched the following from wikipedia.

On November 24, 2008, the SEC found Prosper to be in violation of the Securities Act of 1933. As a result of these findings, the SEC imposed a cease and desist order on Prosper. Due primarily to the novel nature of the peer-to-peer lending models, the SEC, after review, now treats all peer-to-peer lending transactions as sales of securities and requires that all platforms register with the SEC.

In July 2009, Prosper reopened their website for lending (“investing”) and borrowing after having obtained SEC registration for its loans. Hence, that can explain why there were a continuous gap period on lender yield.

Bivariate Analysis

There is a continuous gap period Prosper was not allowed to run their business until it had obtained SEC registration for its loans. The lender yield is stale in the years.

What’s the impact after risk rating provided to borrowers?

We observed strong positive/negative linear correlation relationship between prosper risk rating and borrower rate, loan amount can request and lenderYield.

Borrowers holds lower risk rating (AA highest -> HR lowest) in prosper would 1. enjoy lower interest rates. 2. borrow more limit amount. 3. more popular with investors.

Lenders would have higher yield, if they are willing to take on higher risk of lose rate.

Multivariate Plots Section

# plot prosoer rating vs. borrowerRate and color by Term

ggplot(data = subset(prosper,ProsperRating..Alpha. != ''),
       aes(x = ProsperRating..Alpha.,y = BorrowerRate),group = 1) +
  geom_point(alpha = 0.1, position = position_jitter(h = 0,width = 0.5),aes(color = Term))+
  ylim(0,quantile(prosper$BorrowerRate,0.95)) 
## Warning: Removed 3612 rows containing missing values (geom_point).

In the previous section we explored there is a strong positive linear correlation relationship between prosper rating and interest rate. here, by adding one more variable Term, we see that for each rating, AA to D, looks the longer loan Term would cause the higher interest rate.

ggplot(data = subset(prosper,ProsperRating..Alpha. != ''),
       aes(x = LoanOriginalAmount, y = BorrowerRate)) + 
  geom_point(aes(color = ProsperRating..Alpha.)) +
  ylim(0,quantile(prosper$BorrowerRate,0.95))
## Warning: Removed 3612 rows containing missing values (geom_point).

with(prosper,cor.test(LoanOriginalAmount,BorrowerRate))
## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and BorrowerRate
## t = -117.58, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3341283 -0.3237719
## sample estimates:
##        cor 
## -0.3289599

This scatter plot shows the relationship of Loan amount, borrow rate and prosper rating. Rating AA,A and B looks can loan amount up to $35000, rating C less than $25000, rating D less than $15000, rating E less than $10000, and the rating highest risk rating HR less than $5000, seeing the plot on y-axis, the interest rate is increasing from risk rating AA to HR.

ggplot(data = subset(prosper,strtrim(ProsperRating..Alpha.,
                       width = 1) != ""),
                     aes(x = round(LoanOriginalAmount/1000)*1000,y = Investors)) +
  geom_line(aes(color = ProsperRating..Alpha.),size = 1,
            stat = 'summary', fun.y = 'mean') +
  xlab('LoanOriginalAmount')

We can see investors are more likely to invest in low risk request, not the amount of the loan, the investors on AA rating looks obviously higher than others.

ggplot(data = prosper,aes(x = LoanOriginationDate,
                          y = LoanOriginalAmount)) +
  geom_point(aes(color = ProsperRating..Alpha.))

by(prosper$LoanOriginalAmount,substr(prosper$LoanOriginationQuarter,1,4),max)
## substr(prosper$LoanOriginationQuarter, 1, 4): Q1 2
## [1] 35000
## -------------------------------------------------------- 
## substr(prosper$LoanOriginationQuarter, 1, 4): Q2 2
## [1] 35000
## -------------------------------------------------------- 
## substr(prosper$LoanOriginationQuarter, 1, 4): Q3 2
## [1] 35000
## -------------------------------------------------------- 
## substr(prosper$LoanOriginationQuarter, 1, 4): Q4 2
## [1] 35000
by(prosper$LoanOriginalAmount,prosper$ProsperRating..Alpha.,max)
## prosper$ProsperRating..Alpha.: AA
## [1] 35000
## -------------------------------------------------------- 
## prosper$ProsperRating..Alpha.: A
## [1] 35000
## -------------------------------------------------------- 
## prosper$ProsperRating..Alpha.: B
## [1] 35000
## -------------------------------------------------------- 
## prosper$ProsperRating..Alpha.: C
## [1] 25000
## -------------------------------------------------------- 
## prosper$ProsperRating..Alpha.: D
## [1] 15000
## -------------------------------------------------------- 
## prosper$ProsperRating..Alpha.: E
## [1] 15900
## -------------------------------------------------------- 
## prosper$ProsperRating..Alpha.: HR
## [1] 16800

This plot shows that the maximum loan amount in prosper website is $15000 in initial year 2015, $25000 before Q2 2013 and $35000 after that. There should be no business run from Q3 2008 to Q2 2009 on prosper, after colored by risk rating, we also found there is no risk rating provided to borrowers, all of them are NA. The risk rating is implemented after Prosper reopened from Q2 2009. The rating D, E, HR can borrow maximum amount around $15000, rating C is up to $25000, all rating AA, A, B are up to maximum loan amount $35000.

ggplot(data = subset(prosper,ProsperRating..Alpha. != "" &
                       CurrentDelinquencies > 0), 
       aes(x = LoanOriginalAmount, y = CurrentDelinquencies)) +
  geom_point(aes(color = ProsperRating..Alpha.)) +
  scale_x_continuous(breaks = seq(0,35000,5000)) +
  ylim(0,30) 
## Warning: Removed 2 rows containing missing values (geom_point).

We can see plot are left skewed. it shows that rating E,HR borrowed less amount, but they have more delinquencies. while the credit rating increasing, the fewer number of delinquencies come out, even though the borrowers labeled lower risk borrowed more money.

Multivariate Analysis

As the loan term and risk rating increase, the interest rate will increase also, there is a strong positive linear correlation relationship between prosper rating and interest rate. if we see the loan amount and risk rating, we found the higher risk rating lead borrowers have to pay more interest rate, and borrowers can have less loan amount also.

Investors are also more likely to invest their money to those borrowers they have lower risk and good credit rating, because they don’t want to lost their money, but still can have yield from this.

Further study the current delinquencies of borrowers , that also explains that high-risk borrowers generally have a higher default rate.

Final Plots and Summary

Plot One

ggplot(data = prosper,aes(x = factor(ProsperRating..Alpha.,
                      levels = c("AA","A","B","C","D","E","HR")))) +
  geom_bar() +
  xlab('Prosper Rating') +
  ggtitle("Prosper rating distribution of created listing") +
  theme(plot.title = element_text(hjust = 0.5))

There are almost 30000 loan listing created before Q2 2009 are not have rating provided. The risk rating from the highest AA to the lowers HR appears Binomial distribution, 57% of borrowers are labeled middle risk B,C and D. Borrowers with rating C with 22% the highest frequency.

Description One

Plot Two

ggplot(data = prosper,aes(x = ProsperRating..Alpha., y = LoanOriginalAmount)) +
  geom_boxplot() +
  scale_y_continuous(breaks = seq(0,36000,5000)) +
  coord_cartesian(ylim = c(3000,15500)) +
  ggtitle("Prosper rating by loan amount") +
  theme(plot.title = element_text(hjust = 0.5))

Description Two

The rating E and HR has smaller IQR compared to the rest of the rating, rating B and C has no outliner, rating A,B has a smaller outliner, rating AA has two outline only. With a correlation of -0.43, it shows a negative moderately strong linear relationship, as the risk rating increase, the lower amount would be borrowed from prosper.

Plot Three

ggplot(data = subset(prosper,ProsperRating..Alpha. != ''),
       aes(x = ProsperRating..Alpha.,y = BorrowerRate),group = 1) +
  geom_point(alpha = 0.1, position = position_jitter(h = 0,width = 0.5),aes(color = Term))+
  ylim(0,quantile(prosper$BorrowerRate,0.95)) +
  ggtitle("Borrower Rate of prosper rating for Term") +
  theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 3612 rows containing missing values (geom_point).

Description Three

There is a strong positive linear correlation relationship 0.95 between prosper rating and interest rate. The fixed (3 0r 5 years) loan Term which borrowers select actually causes different borrower rate on each risk rating.

Reflection

The dataset is from the Prosper loans, it’s America’s first peer-to-peer lending marketplace, where people could loans money by listing it in the website. There are 113937 observations and 81 variables in the dataset.

I extract 15 features and explore loan term distribution, the borrowers’ Occupation, rating, annual incomes, employmentStatus, and what purpose they loan from Prosper. There almost 80% of the borrowers selected 3 years loan term. I am surprise that there are 1614 listing was created from 2010 to 2013 with 1 year loan term, prosper.com doesn’t provide such term for loan. Not sure if prosper opened for this term or what problems caused this data issue. While we seeing Occupation of borrowers, the top 1 is Other, it’s understandable because all not in the website selecting list can be other, the second to third are professional and computer programer. The prosper rating are not provided before Q2 2009, Ratings, from lowest-risk to highest-risk, are labeled AA, A, B, C, D, E, and HR (“High Risk”), the distributions of rating follow an almost unimodal distribution, with the peak of 18345. The borrowers incomes range distributions also looks follow an almost unimodal distribution, we noticed that there are 621 loan list with zero annual income, then I further explored their employment status when they creating list, most of them are in employed status, hence, those zero incomes should be data issue. Finally, I explore the purpose of borrowers loan from prosper, found the highest category is 1 - Debt Consolidation, where’s one take out a loan to pay many others. we may guess it could be that many Prosper visitors comes with already have loans, and want to search some loans to pay for it.

Prosper rating is a very important feature, looking at the statistics and plot, Prosper Rating is strongly correlated with borrower’s interest rate, the number of investors, and the LenderYield. It can be explained that this lower risk rating the borrowers labeled, the lower borrower’s interest rate would have, and will attract more investors also. For lenders, if they would like to fund for those borrowers labeled higher risk rating, the higher yield will have, but on other hand, it will increase the money lost rate.

Reference

Udacity: https://classroom.udacity.com/nanodegrees/nd002-cn-advanced/parts/7f46cd58-8041-4d9d-88a5-4b7c6f7be63e
wikipedia: https://en.wikipedia.org/wiki/Prosper_Marketplace
ggplot: http://ggplot2.org/
Variable Definitions: https://rstudio-pubs-static.s3.amazonaws.com/86324_ab1e2e2fa210452f80a1c6a1476d7a2a.html